A Fast Heuristic forApproximate String Matching 2

نویسندگان

  • Ricardo Baeza-Yates
  • Gonzalo Navarro
چکیده

We study a fast algorithm for on-line approximate string matching. It is based on a non-deterministic nite automaton, which is simulated using bit-parallelism. If the automaton does not t in a computer word, we partition the problem into subproblems. We show experimentally that this algorithm is the fastest for typical text search. We also show which algorithms are the best in other cases, and derive the fastest known heuristic for on-line approximate string matching, when the pattern is not very large. The focus of this work is mainly practical.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Fast Heuristic for Exact String Matching

Given a pattern string P of length n consisting of δ distinct characters and a query string T of length m, where the characters of P and T are drawn from an alphabet Σ of size ∆, the exact string matching problem consists of finding all occurrences of P in T . For this problem, we present a randomized heuristic that in O(nδ) time preprocesses P to identify sparse(P ), a rarely occurring substri...

متن کامل

A Comparison of String Metrics for Matching Names and Records

We describe an open-source Java toolkit of methods for matching names and records. We summarize results obtained from using various string distance metrics on the task of matching entity names. These metrics include distance functions proposed by several different communities, such as edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods. We ...

متن کامل

A Comparison of String Distance Metrics for Name-Matching Tasks

Using an open-source, Java toolkit of name-matching methods, we experimentally compare string distance metrics on the task of matching entity names. We investigate a number of different metrics proposed by different communities, including edit-distance metrics, fast heuristic string comparators , token-based distance metrics, and hybrid methods. Overall, the best-performing method is a hybrid s...

متن کامل

GRASPm: an efficient algorithm for exact pattern-matching in genomic sequences

In this paper, we propose Genomic-oriented Rapid Algorithm for String Pattern-match (GRASPm), an algorithm centred on overlapped 2-grams analysis, which introduces a novel filtering heuristic - the compatibility rule - achieving significant efficiency gain. GRASPm's foundations rely especially on a wide searching window having the central duplet as reference for fast filtering of multiple align...

متن کامل

Occurrence and Substring Heuristics for i-Matching

We consider a version of pattern matching useful in processing large musical data: matching, which consists in finding matches which are -approximate in the sense of the distance measured as maximum difference between symbols. The alphabet is an interval of integers, and the distance between two symbols , is measured as . We also consider -matching, where is a bound on the total sum of the diff...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996